Skip to content

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444

Open
iangmaia wants to merge 12 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests-gem
Open

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444
iangmaia wants to merge 12 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests-gem

Conversation

@iangmaia
Copy link
Copy Markdown
Contributor

@iangmaia iangmaia commented Mar 24, 2026

Summary

  • Adds a Buildkite command script and pipeline step for running AI E2E tests using the simulator-llm-pilot gem
  • Checks for "Testing" label on PR (skips if missing to save CI resources)
  • Downloads build artifacts, installs app on simulator, installs the gem from GitHub, runs tests

The gem handles everything internally: simulator detection, WDA lifecycle, agent loop with sandboxed tools, context window compression, verification/cleanup enforcement, and structured results.

Alternative approach: see #25443 for a Claude Code + wrapper scripts version of the same pipeline.

Ref: AINFRA-2176

Test plan

  • Run .buildkite/commands/run-ai-e2e-tests.sh locally with a booted simulator and test site credentials
  • Run a simple test case (users-screen-loads.md) end-to-end
  • Verify results.md is written with correct pass/fail status

🤖 Generated with Claude Code

@dangermattic
Copy link
Copy Markdown
Collaborator

1 Message
📖 This PR is still a Draft: some checks will be skipped.

Generated by 🚫 Danger

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.
App NameWordPress
ConfigurationRelease-Alpha
Build Number32185
VersionPR #25444
Bundle IDorg.wordpress.alpha
Commit146daa1
Installation URL7l99o9dpqifu8
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.
App NameJetpack
ConfigurationRelease-Alpha
Build Number32185
VersionPR #25444
Bundle IDcom.jetpack.alpha
Commit146daa1
Installation URL04hfgpqarjii0
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@iangmaia iangmaia self-assigned this Mar 24, 2026
@iangmaia iangmaia added the Testing Unit and UI Tests and Tooling label Mar 25, 2026
@iangmaia iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch 2 times, most recently from 1602fa9 to 8589139 Compare March 30, 2026 17:25
@sonarqubecloud
Copy link
Copy Markdown

@crazytonyli
Copy link
Copy Markdown
Contributor

Hi @iangmaia , shall we land this and start running nightly jobs?

iangmaia and others added 12 commits May 8, 2026 17:16
The gem provides a sandboxed agent that drives the simulator through a
fixed set of tools (tap, swipe, type, REST API) with no arbitrary code
execution. It handles WDA lifecycle, session management, context window
compression, and verification/cleanup enforcement internally.

The Buildkite step:
- Checks for "Testing" label (skips if missing)
- Downloads build artifacts and installs app on simulator
- Installs the simulator-llm-pilot gem from GitHub
- Runs all test cases in Tests/AgentTests/ui-tests/

Ref: AINFRA-2176

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gem build resolves spec file paths relative to cwd, so
bin/simulator-llm-pilot wasn't found when building from the
wordpress-ios repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract WDA build to a separate build-wda.sh script for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gem no longer hardcodes WordPress login flow in its system prompt.
Add app-instructions.md with the WordPress/Jetpack login flow and pass
it via --app-instructions-file. Also pass --app-name so the LLM knows
the app's display name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iangmaia iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch from efadbc8 to 146daa1 Compare May 8, 2026 15:16
@iangmaia iangmaia marked this pull request as ready for review May 8, 2026 15:17
Copilot AI review requested due to automatic review settings May 8, 2026 15:18
@iangmaia
Copy link
Copy Markdown
Contributor Author

iangmaia commented May 8, 2026

@crazytonyli Hey Tony! With all the recent changes and updates this got left behind 😓 sorry about that. There's not much work left to start running it and iterating on the tests IMO, so that's the good side.
There are a couple of open questions in paaHJt-9Te-p2 related to the tests themselves (one of them always failed iinm).

As mentioned in the P2, I think that this PR + simulator-llm-pilot is the way to go for E2E AI tests, but it would be nice to make it fully 🟢 to start with.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Buildkite CI step to run AI-driven end-to-end UI tests on an iOS Simulator using the simulator-llm-pilot gem, including helper scripts for installing the gem, locating/booting a simulator, and building WebDriverAgent.

Changes:

  • Adds a new Buildkite pipeline step (PR-only) to run AI E2E tests and upload Tests/AgentTests/results/** artifacts.
  • Introduces CI scripts to install simulator-llm-pilot, find a booted simulator, build WebDriverAgent, install the app, and run the test suite.
  • Updates AI test/navigation skill docs and adds app login instructions used by the test runner.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
Tests/AgentTests/app-instructions.md Adds login-flow instructions for the agent-runner to avoid unsafe/manual credential entry.
Scripts/ci/install-simulator-llm-pilot.sh Installs simulator-llm-pilot by building from a local checkout or cloning from GitHub.
Scripts/ci/find-booted-simulator.rb Helper to return a booted simulator UDID (optionally waiting/polling).
.claude/skills/ios-sim-navigation/SKILL.md Aligns documentation placeholder naming (<APP_BUNDLE_ID>).
.claude/skills/ai-test-runner/SKILL.md Aligns documentation placeholder naming (<APP_BUNDLE_ID>).
.buildkite/pipeline.yml Adds a new “AI E2E Tests” Buildkite step gated to PR builds.
.buildkite/commands/run-ai-e2e-tests.sh Orchestrates artifact download, simulator/app setup, WDA build, and simulator-llm-pilot run.
.buildkite/commands/build-wda.sh Clones/builds WebDriverAgent and skips rebuild when artifacts already exist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +32
SIMULATOR_LLM_PILOT_REPO_URL="${SIMULATOR_LLM_PILOT_REPO_URL:-https://github.com/Automattic/simulator-llm-pilot.git}"
SIMULATOR_LLM_PILOT_SOURCE_PATH="${SIMULATOR_LLM_PILOT_SOURCE_PATH:-}"

build_dir="$(mktemp -d)"
trap 'rm -rf "$build_dir"' EXIT

source_path="${SIMULATOR_LLM_PILOT_SOURCE_PATH}"
if [[ -z "$source_path" && -f "${DEFAULT_LOCAL_GEM_PATH}/simulator-llm-pilot.gemspec" ]]; then
source_path="${DEFAULT_LOCAL_GEM_PATH}"
fi

if [[ -n "$source_path" ]]; then
echo "Using local simulator-llm-pilot source at ${source_path}"
if [[ -d "${source_path}/.git" ]]; then
source_revision="$(git -C "${source_path}" rev-parse HEAD)"
git -C "${source_path}" archive HEAD | tar -x -C "$build_dir"
else
source_revision="local-filesystem"
tar -cf - -C "${source_path}" . | tar -xf - -C "$build_dir"
fi
else
echo "Cloning simulator-llm-pilot from ${SIMULATOR_LLM_PILOT_REPO_URL}"
git clone --depth 1 "${SIMULATOR_LLM_PILOT_REPO_URL}" "$build_dir"
source_revision="$(git -C "$build_dir" rev-parse HEAD)"
fi
Comment on lines +24 to +25
WEBDRIVERAGENT_REPO_URL="${WEBDRIVERAGENT_REPO_URL:-https://github.com/appium/WebDriverAgent.git}"
WEBDRIVERAGENT_REF="${WEBDRIVERAGENT_REF:-}"
Comment on lines +51 to +55
ensure_wda_checkout

if [[ -d "$WDA_PROJECT" ]] && has_built_artifacts; then
echo "WebDriverAgent already built, skipping."
exit 0
Comment on lines +126 to +127
TIMESTAMP="$(date +%Y-%m-%d-%H%M)"
RESULTS_DIR="Tests/AgentTests/results/${TIMESTAMP}"
Comment on lines +94 to +98
UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 2>/dev/null || true)"
if [[ -z "$UDID" ]]; then
echo "No booted simulator named '$SIMULATOR_NAME' found. Booting..."
xcrun simctl boot "$SIMULATOR_NAME" 2>/dev/null || true
UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 30 1 2>/dev/null || true)"
Comment on lines +14 to +15
output, status = Open3.capture2('xcrun', 'simctl', 'list', 'devices', 'booted', '-j')
exit 1 unless status.success?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Status] DO NOT MERGE Testing Unit and UI Tests and Tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants